LightMem: Lightweight and efficient memory-augmented generation
LightMem proposes a new, human-inspired memory system for large language model (LLM) agents that keeps long-term context while drastically cutting token usage, API calls, and latency compared with existing memory frameworks.
Research topic and objective
The paper studies how to design an external memory system for LLM agents that is both accurate and computationally efficient during long, multi-turn interactions.
Its main objective is to introduce LightMem, a three-stage memory architecture inspired by human sensory, short-term, and long-term memory, and to show that it improves questionâanswering accuracy while sharply reducing cost on longâcontext benchmarks.
Key findings and conclusions
- LightMem consistently achieves higher accuracy than strong memory baselines (like AâMem, MemoryOS, Mem0, LangMem) on the LongMemEval and LoCoMo dialogue benchmarks, using both GPTâ4oâmini and Qwen3â30B backbones.
- At the same time, it reduces total token usage by up to 38Ă (GPT) and 21.8Ă (Qwen), and cuts API calls by up to 30Ă and 17.1Ă respectively on LongMemEval; online testâtime savings are even larger, reaching over 100Ă fewer tokens and hundreds of times fewer calls.
- On LoCoMo, LightMem improves accuracy by up to about 18 percentage points for GPT and around 29 percentage points for Qwen compared to memory baselines, while reducing total tokens by up to about 20.9Ă and API calls by up to about 55.5Ă.
- The authors conclude that a humanâmemoryâinspired pipelineâearly compression, topicâaware grouping, and offline âsleepâtimeâ consolidationâcan simultaneously enhance longâhorizon reasoning and efficiency for LLM agents.
Critical data and facts
Architecture and mechanism
- LightMem has three main modules that mimic human memory stages:
- Light1 (Sensory memory): A preâcompression module uses LLMLinguaâ2 (or similar) to filter out lowâvalue tokens from each turn, keeping only the most informative ones based on tokenâlevel retention probabilities and entropy measures.
- Light2 (Shortâterm memory, STM): Compressed content is buffered and segmented into topicâcoherent groups using a hybrid method that combines attention patterns and semantic similarity between dialogue turns to find topic boundaries, then summarized by an LLM once a token threshold is reached.
- Light3 (Longâterm memory, LTM): Summaries, embeddings, and raw turns are stored as memory entries; new entries are inserted via âsoftâ updates at test time, and heavier reâorganization, deâduplication, and abstraction are deferred to an offline âsleepâtimeâ phase that runs parallel, batched update operations.
- Memory entries store a topic label, an embedding of the summary, and the user/model turns, enabling semantic retrieval and later consolidation with timeâaware update queues that only allow newer entries to update older ones.
Complexity and efficiency gains
- In a dialogue with (N) turns and average (T) tokens per turn, conventional systems typically require (O(N)) summarization calls and updates, whereas LightMem reduces the number of calls to roughly (\frac{Nr^{x}T}{th}), where (r) is the compression ratio, (x) the number of compression iterations, and (th) the STM buffer threshold.
- This design changes runtime complexity for memory construction from (O(N)) to (O(\frac{Nr^{x}T}{th})), explaining the observed large reductions in API calls and tokens.
Benchmark results (selected)
LongMemEval (GPTâ4oâmini backbone):
- Strong baseline AâMem reaches accuracy around 62.6%, while LightMem configurations reach up to about 68.6% accuracy, a gain of roughly 2.1â6.4 percentage points over the best baseline.
- Compared with baselines, LightMem reduces total token usage by about 10Ăâ38Ă, reduces API calls by about 3.6Ăâ30Ă, and speeds runtime by roughly 2.9Ăâ12.4Ă when counting both online and offline phases.
- When counting only online testâtime cost, LightMem cuts tokens by up to about 105.9Ă and API calls by up to about 159.4Ă relative to other memory systems.
LongMemEval (Qwen3â30B backbone):
- LightMem improves accuracy by up to about 7.67 percentage points over AâMem, with configurations reaching about 70.2% accuracy.
- It reduces total tokens by around 6.9Ăâ21.8Ă and API calls by roughly 3.3Ăâ17.1Ă, with runtime speedups of about 1.6Ăâ6.3Ă.
LoCoMo (GPTâ4oâmini backbone):
- Memory baselines such as AâMem, MemoryOS, Mem0 generally reach accuracies in the midâ50s to midâ60s, while LightMem variants reach around 70â73%, for gains of roughly 6.1â18.1 percentage points.
- LightMem reduces total tokens by about 2.87Ăâ20.92Ă, reduces API calls by about 13.29Ăâ39.78Ă, and speeds runtime by around 2.63Ăâ8.21Ă.
LoCoMo (Qwen3â30B backbone):
- LightMem configurations achieve around 71â73% accuracy, exceeding baselines by roughly 4.4â29.3 percentage points.
- It cuts total tokens by about 3.33Ăâ18.02Ă, API calls by about 12.96Ăâ55.48Ă, and runtime by around 1.18Ăâ5.57Ă.
Analyses of submodules
- Preâcompression: Compressing to 50â80% of original tokens on LongMemEval preserves QA performance close to uncompressed input while drastically reducing tokens; the compression model runs with under 2 GB GPU memory and adds negligible overhead.
- Topic segmentation: The hybrid attentionâplusâsimilarity method achieves over 80% segmentation accuracy against groundâtruth session boundaries and outperforms attentionâonly or similarityâonly variants.
- Ablation: Removing topic segmentation slightly improves efficiency but drops QA accuracy by about 6.3 percentage points (GPT) and 5.4 percentage points (Qwen), showing its importance for preserving semantic units.
- STM threshold: Larger STM thresholds consistently improve efficiency (fewer calls, less token usage) but affect accuracy in a nonâmonotonic way; optimal thresholds depend on the compression ratio and model, reflecting a tradeâoff between cost and performance.
- Sleepâtime update: Soft, appendâonly updates at test time avoid irreversible information loss from misâhandled realâtime edits, while offline parallel updates use similarity queues and timestamps to reconcile and consolidate memories with low latency.